Water Quality Analysis Data

Submitted By: GROUP 9

Anisha Siwas-025007

Sarthak Jain-025029

Tanya Goel-025034

Dataset Information: Access to safe drinking-water is essential to health, a basic human right and a component of effective policy for health protection. This dataset contains water quality metrics for 3276 different water bodies.

  1. ph: pH of 1. water (0 to 14).

  2. Hardness: Capacity of water to precipitate soap in mg/L.

  3. Solids: Total dissolved solids in ppm.

  4. Chloramines: Amount of Chloramines in ppm.

  5. Sulfate: Amount of Sulfates dissolved in mg/L.

  6. Conductivity: Electrical conductivity of water in μS/cm.

  7. Organic_carbon: Amount of organic carbon in ppm.

  8. Trihalomethanes: Amount of Trihalomethanes in μg/L.

  9. Turbidity: Measure of light emiting property of water in NTU.

  10. Potability: Indicates if water is safe for human consumption. Potable - 1 and Not potable - 0

IMPORTING LIBRARIES

DATA READING AND EXPLORATION

Checking for shape of dataset. This dataset contains 3276 rows and 10 columns.

Checking first 5 rows in dataset and getting an overview of what is dataset all about and what all columns are there.

Checking how many unique values are there in each column.

Describing the dataset. Calculating mean, standard deviation etc. for each column.

Checking wheather dataset is balanced or not. Classes are not unbalanced.

Checking for data types of each feature.

Checking the data for null values.

As our dataset contains many null values in 3 different columns therefore we will fill these null values with either mean or median of those particular columns. We can not drop these rows as null values are many. We will fill null values by mean when there are no outliers and median when there are outliers.

DATA VISUALIZATION

FEATURE ENGINEERING

We categorised the ph level column into three categories. Here we are using qcut in order to have equal count of data in all the three bins named l=low, m=medium and h=high.

After categorising the ph levels, we plotted it graphically to check the ratio of potability to non potability with respect to ph levels.

Renaming columns.

DICOVERING STRUCTURE

USING t-SNE DIMENTIONALITY REDUCTION TO DISCOVER IF DATA POSSESS ANY STRUCTURE

Popping out the target variable in order to separate predictors and target.

Splitting the dataset into train and validation parts. Here we have splitted our dataset in such a manner that 80% of the dataset will be used for training the model and other 20% of the dataset will be used to test the model.

Visualising through t-SNE

Visualising through t-SNE in 3D

DATA PIPELINING

Now we will be separating out categorical and numerical features.

No categorical feature in training dataset.

We will create two subsets of num_cols, One set we will impute using 'mean' and the other using 'median'.

After seeing outliers for all the columns we can say that numericals columns for which we will be using strategy as mean are the features which has very less or no outliers i.e, Conductivity, Organic_carbon and Turbidity, for the features that has more outliers we will be using median strategy i.e, ph, Hardness, Solids, Chloramines, Sulfate and Trihalomethanes.

CREATING PIPES

Instantiating pipeline object for processing numerical data. Impute = median.

Instantiating pipeline object for processing numerical data. Impute = mean.

TESTING THE PIPES

Feeding data to each pipe to see if it is working.

GATHERING THE PIPES INTO COLUMN TRANSFORMER

Collecting all pipes in column transformer along the column names. All pipes operate parallely.

TESTING THE COLUMN TRANSFORMER

Final Pipeline for transformation and modeling.

Train on data using final pipe.

Making prediction on test data. Here we do not need to transform X_test separately as pipes take care of that.

Transforming y_test

Display pipeline as diagram.

Pipeline as text.

HYPERPARAMETER TUNING